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Abstract. Although most business application data is stored in rela¬ 
tional databases, programming languages and wire formats in integration 
middleware systems are not table-centric. Due to costly format conver¬ 
sions, data-shipments and faster computation, the trend is to “push-down” 
the integration operations closer to the storage representation. We ad¬ 
dress the alternative case of defining declarative, table-centric integration 
semantics within standard integration systems. For that, we replace the 
current operator implementations for the well-known Enterprise Integra¬ 
tion Patterns by equivalent “in-memory” table processing, and show a 
practical realization in a conventional integration system for a non-reliable, 
“data-intensive” messaging example. The results of the runtime analy¬ 
sis show that table-centric processing is promising already in standard, 
“single-record” message routing and transformations, and can potentially 
excel the message throughput for “multi-record” table messages. 

Keywords: Datalog, Message-based / Data integration, Integration System. 

1 Introduction 

Integration middleware systems in the sense of EAI brokers 0 (e.g., SAP 
HANA Cloud IntegratiorQ Boomi AtomSpher^]) address the fundamental need 
for (business) application integration by acting as a messaging hub between 
applications. As such, they have become ubiquitous in service-oriented enterprise 
computing environments. Messages are mediated between applications mostly in 
wire formats based on XML (e. g., SOAP for Web Services). 

The advent of more “data-aware” integration scenarios (observation 01 ) put 
emphasis on (near) “real-time” or online processing (02), which requires us 
to revisit the standard integration capabilities, system design and architectural 
decisions. For instance, in the financial / utilities industry, China Mobile generates 
5-8 TB of call detail records per day, which have to be processed by integration 
systems (i.e., mostly message routing and transformation patterns), “convergent 
charging”]^] (CC) and “invoicing” applications (not further discussed). In addition, 

1 SAP HCI, visited 04/2015: https://help.sap.com/cloudintegration 

2 Boomi AtomSphere, visited 04/2015: http://www.boomi.com/integration 

3 Solace Solutions, visited 02/2015; last update 2012: http://www.solacesystems 
com/techblog/deconstructing-kafka 
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the standard XML-processing has to give ground to other formats like JSON 
and CSV (03). These observations (01-3) are backed by similar scenarios from 
sports management (e. g., online player tracking) and the rapidly growing amount 
of data from the Internet of Things and Cyber Physical System domains. For 
those scenarios, an architectural setup with systems like Message Queuing (MQ) 
are used as reliable “message buffers” (i. e., queues, topics) that handle “bursty” 
incoming messages and smoothen peak loads (cf. Fig. [l]). Integration systems 
are used as message consumers, which (transactionally) dequeue, transform 
(e. g., mapping, content enrichment) and route messages to applications. For 
reliable transport and message provenance, integration systems require relational 
Database Systems, in which most of the (business) application data is currently 
stored (04). When looking at the throughput capabilities of the named systems, 
software-/hardware-based MQ systems like Apache Kafka or SWac^j are able to 
process several millions of messages per second. RDBMS benchmarks like TPC-H 
measure queries and inserts in PB sizes, while simple, header-based routing 
benchmarks for integration systems show message throuphputs of few thousands 
of messages per second [2] (05). In other words, MQ and DBMS (e.g., RDBMS, 
NoSQL, NewSQL) systems are already addressing observations 01-5. Integration 
systems, however, seem to not be there yet. 

Compared to MQs, integration systems work on message data, which seems to 
make the difference in message throughput. We argue that integration operations, 
represented by Enterprise Integration Patterns (EIP) [9], can be mapped to an 
“in-memory” representation of the table-centric RDBMS operations to profit 
from their efficient and fast evaluation. Early ideas on this were brought up 
in our position papers mm- In this work, we follow up to shed light on the 
observed discrepancies. We revisit the EIP building blocks and operator model 
of integration systems, for which we define RDBMS-like table operators (so far 
without changing their semantics) as a symbiosis of RDBMS and integration 
processing by using Datalog [16|. We choose Datalog as example of an efficiently 
computable, table-like integration co-processing facility close to the actual storage 
representation with expressive foundations (e.g., recursion), which we call Table¬ 
centric Integration Patterns (TIP). To show the applicability of our approach 
to integration scenarios along observations 01-5 we conduct an experimental 
message throughput analysis for selected routing and transformation patterns, 
where we carefully embed the TIP definitions into the open-source integration 
system Apache Camel [3] that implements most of the EIPs. Not changing the 
EIP semantics means that table operations are executed on “single-record” table 
messages. We give an outlook to “multi-record” table message processing. 

The remainder of this paper is organized along its main contributions. After a 
more comprehensive explanation of the motivating CC example and a brief sketch 
of our approach in Sect. [2j we analyse common integration patterns with respect 
to their extensibility for alternative operator models and define a table-centric 
operator / processing model that can be embedded into the patterns (still) aligned 
with their semantics using for example Datalog in Sect. [3] In Sect. [4] we apply our 
approach to a conventional integration system and briefly describe and discuss 
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our experimental performance analysis, and we draw an experimental sketch of 
the idea of “multi-record” table message processing. Section [5] examines related 
work and Sect. [6] concludes the paper. 

2 Motivating Example and General Approach 

In this section, the motivating “Call Record Detail” example in the context of the 
“Convergent Charging” application is described more comprehensively along with 
a sketch of our approach. Figure |T] shows aspects of both as part of a common 
integration system architecture. 


Integration Operations 
(content-level) 


Integration System 
(system-level) 
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Fig. 1 . High-level overview of the convergent charging application and architecture. 


2.1 The Convergent Charging Scenario 

Mobile service providers like China Mobile generate large amounts of “Call 
Record Details” (CRDs) per day that have been processed by applications like 
SAP Convergent Charging (CC). As shown in Fig. [l] these CRDs are usually 
sent from mobile devices to integration systems (optionally buffered in a MQ 
System), where they are translated to an intermediate (application) format and 
enriched with additional master data (e. g., business partner, product). The 
master data helps to calculate pricing information, with which the message 
is split into several messages, denoting billable items (i.e., item for billing) 
that are routed to their receivers (e. g., DB). From there applications like SAP 
Convergent Invoicing generate legally binding payment documents. Alternatively, 
new application and data analytics stacks like LogicBlox [7], WebdamLog pQ, and 
SAP S/^ -/L4AC0 (not shown) access the data for further processing. Some of these 

4 SAP Convergent Charging, last visited 04/2015: https://help.sap.com/cc 

5 SAP S/4HANA, last visited 04/2015: http://discover.sap.com/S4HANA 
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“smart” stacks even provide declarative, Datalog-like language for application 
and user-interface programming, which complements our integration approach. 
As motivated before, standard integration systems have problems processing the 
high number and rate of incoming messages, which usually leads to an “offline”, 
multi-step processing using indirections like ETL systems and pushing integration 
logic to the applications, leading to long-running CC runs. 


2.2 General Approach 

The Enterprise Integration Patterns (EIPs) [9j define “de-facto” standard opera¬ 
tions on the header (i. e., payload’s meta-data) and body (i. e., message payload) 
of a message, which are normally implemented in the integration system’s host 
language (e. g., Java, Cff). This way, the actual integration operation (i.e., the 
content developed by an integration expert like mapping programs and routing 
conditions) can be differentiated from the implementation of the runtime system 
that invokes the content operations and processes their results. For instance, 
Fig.g shows the separation of concerns within integration systems with respect 
to “system-related” and “content-related parts” and sketches which pattern 
operations to re-define using relational table operators, while leaving the runtime 
system (implementation) as is. The goal is to only change these operations and 
make integration language additions for table-centric processing within the con¬ 
ventional integration system, while preserving the general integration semantics 
like Quality of Service (e. g., best effort, exactly once) and the Message Exchange 
Pattern (e. g., one-way, two-way). In other words, the content-related parts of the 
pattern definitions are evaluated by an “in-process” table operation processor 
(e. g., a Datalog system), which is embedded into the standard integration system 
and invoked during the message processing. 


3 Table-centric Integration Patterns 

Before defining Table-centric Integration Patterns (short TIP) for message routing 
and transformation more formally, let us recall the encoding of some relevant, basic 
database operations / operators into Datalog: join, projection, union, and 
selection. The join of two relations r(a;, y) and s{y , z) on parameter y is encoded 
as j(x, y, z ) e- r(x, y ), s(y, z ), which projects all three parameters to the resulting 
predicate j. More explicitly, a projection on parameter x of relation r(x,y) is 
encoded as p{x) <S— r(x,y). The union of r(x,y) and s(x,y) is u(x,y) <— r(x,y). 
u(x,y) <— s{x,y ), which combines several relations to one. The selection r(x,y) 
according to a built-in predicate <j>{x), where 4>{x) can contain constants and 
free variables, is encoded as s(x,y) <— r(x, y), 4>{x). Built-in predicates can be 
binary relations on numbers such as <,<=,=, binary relations on strings such as 
equals , contains , startswith or predicates applied to expressions based on binary 
operators like (e. g., x = p(y) + 1), and operations on relations like 

2 = max(p(x , y), x), z = min{p(x , y), x), which would assign the maximal or the 
minimal value x of a predicate p to a parameter z. 
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Although our approach allows each single pattern definition to evaluate 
arbitrary, recursive Datalog operations and built-in predicates, the Datalog to 
pattern mapping tries to identify and focus on the most relevant table-centric 
operations for a specific pattern. An overview of the mapping of all discussed 
message routing and transformation operations to Datalog constructs is shown 
in Fig. [2] and is subsequently discussed. Subsequently, we enumerate common 


Message Routing 

Router, Filter: 
Recipient List 
Multicast, Join Router 
Splitter 

Correlation, Completion 
Aggregation 

Message Transformation 

Message translator 
Content filter 
Content enricher 


Fig. 2. Message routing and transformation patterns mapped to Datalog. Most common 
Datalog operations for a single pattern are marked “dark blue”, less common ones 
“light blue”, and possible but uncommon ones “white”. 



EIPs and separate system- from content-related parts more formally for the TIP 
definition by example of standard Datalog. 


3.1 Canonical Data Model 

When connecting applications, various operations are executed on the transferred 
messages in a uniform way. The arriving message instances are converted into an 
internal format understood by the pattern implementation, called the Canonical 
Data Model (CDM) [9], before the messages are transformed to the target format. 
Hence, if a new application is added to the integration solution, only conversions 
between the CDM and the application format have to be created. Consequently, 
for a table-centric re-definition of integration patterns, we define a CDM similar 
to relational database tables as Datalog programs, which consists of a collection 
of facts / a table, optional (supporting) rules as message body and an optional 
set of meta-facts that describes the actual data as header. For instance, the 
data-part of an incoming message in JSON format is transformed to a collection 
of Open-Next-Close (ONC)-style table iterators, each representing a table row 
or fact. These ONC-operators are part of the evaluated execution plan for more 
efficient evaluation. 
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3.2 Message Routing Patterns 

In this section the message routing pattern implementations are re-defined, which 
can be seen as control and data flow definitions of an integration channel pipeline. 
For that, they access the message to route it within the integration system and 
eventually to its receiver (s). They influence the channel and message cardinality 
as well as the content of the message. 

Content-based, Router / Message Filter The most common routing patterns that 
determine the message’s route based on its body are the Content-based Router 
and the Message Filter. The stateless router has a channel cardinality of 1 :n, 
where n is the number of leaving channels, while one channel enters the router, 
and a message cardinality of 1:1. The entering message constitutes the leaving 
message according to the evaluation of a routing condition. This condition is 
a function rc, with {booli, bool 2 ,bool n } := rc(msgi n , conds), where msgi n is 
the entering message. The function rc evaluates to a list of Boolean output 
{booli, bool 2 ,..., bool n } based on a list of conditions conds of the same arity (e. g., 
Datalog rules in Suppl. Material, List. 0 for each of the n £ N leaving channels. 
In case several conditions evaluate to true, only the first matching channel 
receives the message. 

Through the separation of concerns, a system-level routing function provides 
the entering message msgi n to the content-level implementation (i. e., in CDM 
representation), which is configured by conds. Since standard Datalog rules are 
truth judgements, and hence do not directly produce Boolean values, we decided, 
for performance and generality considerations, to add an additional function 
bool rc to the integration system. The function bool rc converts the output list fact 
of the routing function from a truth judgement to a Boolean by emitting true if 
fact 0, and false otherwise. Accordingly we define the TIP routing condition 
as fact := rcti P (msgi n , conds) , while being evaluated for each channel condition 
(e. g., selection / built-in predicates). The integration system will then use the 
function bool rc to convert this into a Boolean value. For the message filter, which 
is a special case of the router that differs only from its channel cardinality of 1:1 
and message cardinality of 1:[0|1], the filter condition is equal to rcu p . 

Multicast / Recipient List / Join Router The stateless Multicast and Recipient 
List patterns route multiple messages to several leaving channels, which gives them 
a message and channel cardinality of 1 :n. While the multicast statically routes 
messages to the leaving channels (i. e., no re-definition required), the recipient list 
determines the channels dynamically. The receiver determination function rd, with 
{outi, out 2 ,out n } := rd(msgi n , [ header.y\body.x ])., computes n £ N receiver 
channel configurations {recvi, recv 2 ,.... recv n } by extracting their key values ei¬ 
ther from an arbitrary message header field header.y or from a message body field 
body.x. The integration system has to implement a receiver determination func¬ 
tion that takes the list of key-strings {recvldi, recvld 2 ,..., recvld m } as input, for 
which it looks up receiver configurations recvo,recvi, ...,recv n , where m,n £ N 
and m > n, and routes copies of the entering message { msg ' out , msg " ut ,..., msg™ ut }. 
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In terms of TIP, rdu p is a projection of message body or header values to 
a unary, output relation. For instance, the receiver configuration keys recvldi and 
recvId-2 have to be part of the message body like body(x / recvId' 1 ).body{x,' recvld' 2 ).. 
Then the rd t i P would evaluate a Datalog rule similar to config(y) 4 — body(x,y ), 
while the keys recvldi and recw/c ?2 correspond to receiver configurations {recv\,recv2}. 


Splitter / Aggregator The antipodal Splitter and Aggregator patterns both have a 
channel cardinality of 1:1 and create new, leaving messages. Thereby the splitter 
breaks the entering message into multiple (smaller) messages (i.e., message 
cardinality of 1 :n) and the aggregator combines multiple entering messages 
to one leaving message (i.e., message cardinality of n:l). Hereby, the stateless 
splitter uses a split condition sc on the content-level, with {out\,out 2 , ..., out n } := 
sc{msgi n , conds ), which accesses the entering message’s body to determine a list 
of distinct body parts {outi, out 2 , ...,out n }, based on a list of conditions conds , 
that are each inserted to a list of individual, newly created, leaving messages 
{msg out i, msg out 2 , • msg outn } with n € TV by a splitter function. The header 
and attachments are copied from the entering to each leaving message. 

The re-defined split condition sc t i P evaluates a set of Datalog rules as conds 
(i.e., mostly selection, and sometimes built-in and join constructs; the latter two 
are marked “light blue”). Each part of the body outi with i £ TV is a set of facts 
that is passed to a split function, which wraps each set into a single message. 

The stateful aggregator defines a correlation condition, completion condi¬ 
tion and an aggregation strategy. The correlation condition crc, with colli := 
crc(msgi n , conds), determines the aggregate collection colli, to which the mes¬ 
sage is stored, based on a list of conditions conds. The completion condition 
cpc, with cp out := cpc(msgi n , [header .y\body .x\), evaluates to a Boolean out¬ 
put cp ou t based on header or body field information (similar to the message 
filter). If cp ou t equals true, then the aggregation strategy as, with agg ou t := 
as(msg] n , rnsg~ n ,.... msg{ n ), is called by an implementation of the messaging 
system and executed, else the current message is added to the collection colli. 
The as evaluates the correlated entering messages colli and emits a new message 
msg ou t.. For that, the messaging system has to implement an aggregation function 
that takes agg ou t (he., the output of as) as input. 

These functions are re-defined as crc t i p and cpc t i P such that the conds are 
Datalog rules mainly with selection and built-in constructs. The cpcti P makes use 
of the defined bool rc function to map its evaluation result (i. e., list of facts or 
empty) to the Boolean value cpout. The aggregation strategy as is re-defined as 
asa p , which mainly uses union to combine lists of facts from different messages 
to one. The message format remains the same. To transform the aggregates’ 
formats, a message translator is used to keep the patterns modular. However, the 
combination of the aggregation strategy with translation capabilities could lead 
to runtime optimizations. 
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3.3 Message Transformation Patterns 


The transformation patterns exclusively target the content of the messages in 
terms of format conversations and content modifications. 

The stateless Message Translator changes the structure or format of the 
entering message without generating a new one (i. e., channel, message cardinality 
1:1). For that, the translator computes the transformed structure by evaluating 
a mapping program mt (e. g., Datalog rules in Suppl. Material, List. 1.21, with 
msg out .body := mt(msgi n .body) . Thereby the field content can be altered. The 
related Content Filter and Content Enricher patterns can be subsumed by 
the general Content Modifier pattern and share the same characteristics as the 
translator pattern. The filter evaluates a filter function mt, which only Liters 
out parts of the message structure (e. g., fields or values) and the enricher adds 
new fields or values as data to the existing content structure using an enricher 
program ep, with msg out .body := ep(msgi n .body, data). 

The re-definition of the transformation function mtu p for the message transla¬ 
tor mainly uses join and projection (plus built-in for numerical calculations 
and string operations, thus marked “light blue”) and selection, projection 
and built-in (mainly numerical expressions and character operations) for the 
content filter. While projections allow for rather static, structural filtering, the 
built-in and selection operators can be used to filter more dynamically based 
on the content. The resulting Datalog programs are passed as msg ou t-body. In 
addition, the re-defined enricher program epu p mainly uses union operations to 
add additional data to the message as Datalog programs. 


3.4 Pattern Composition 

Since the TIP definitions target the content-level, all patterns can still be 
composed to more complex integration programs (i. e., integration scenarios 
or pipelines). From the many combinations of patterns, we briefly discuss two 
important structural patterns that are frequently used in integration scenarios: 
(1) scatter/gather and (2) splitter/gather [9]. The scatter/gather pattern (with 
a l:n:l channel cardinality) is a multicast or recipient list that copies messages 
to several, statically or dynamically determined pipeline configurations, which 
each evaluate a sequence of patterns on the messages in parallel. Through an 
aggregator pattern, the messages are structurally and content-wise joined. The 
splitter/gather pattern (with a l:n:l message cardinality) splits one message 
into multiple parts, which can be processed in parallel. In contrast to the scat¬ 
ter/gather the pattern sequence is the same for each instance. A subsequently 
configured aggregator combines the messages to one. 

4 Experimental Evaluation 

As System under Test (SuT) for an experimental evaluation we used the open 
source, Java-based Apache Camel integration system j2j in version 2.14.0, which 
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implements most of the EIPs. The Camel system allows content-level extensions 
through several interfaces, with which the TIP definitions were implemented and 
embedded (e. g., own Camel Expression definitions for existing patterns, and 
Camel Processor definitions for custom or non-supported patterns). The Datalog 
system we used for the measurements is a Java-based, standard naive-recursive 
Datalog processor (i. e., without stratification) [H>] in version 0.0.6 from 1133- 

Subsequently, the basic setup and execution of the measurements are in¬ 
troduced. However, due to brevity, a more detailed description of the setup is 
provided in the Suppl. Material, Sect. |A.1[ the routing condition and mapping 
programs are shown in Sect. [A~2j the integration scenarios in Sect. [A~TT| and the 
more detailed results in Sect. IA.41 

4.1 Setup 

In the absence of an EIP benchmark, which we are currently developing on the 
basis of this paper, we used Apache JMetei^\m version 2.12 as a load generator 
client that sends messages to the SuT. We implemented a JMeter Sampler, which 
allows to inject messages directly to the integration pipeline via a Camel direct 
endpoint / adapter. For the throughput measurements, we used the JMeter jp@gc 
transaction per second listener plugin from the standard package. 

To measure the message throughput in a “data-intensive” (cf. 01 ), non¬ 
reliable integration scenario, we use the standard TPC-H order, customer and 
nation data sets. We added additional, unique message identifier and type fields 
and translate the single records to JSON objects (cf. 03), each representing the 
payload of a single message (i. e., “single-record” table message). In this way we 
generated 1.5 million order-only messages (i. e., TPC-H scale level 1) and the 
same amount of “multi-format” customer / nation messages, consisting of one 
customer and all 25 nation records per message (in the “single-record” table 
message case). During the measurements these messages are streamed to the 
Camel endpoint, serialized to either Java Objects for the Camel-Java and to 
the ONC representation for the Camel-Datalog case (cf. recall ONC-iterators as 
canonical data model). 

All measurements are conducted on a HP Z600 work station, equipped with 
two Intel processors clocked at 2.67GHz with a total of 12 cores, 24GB of main 
memory, running a 64-bit Windows 7 SP1 and a JDK version 1.7.0. The JMeter 
Sampler and the integration system pipeline JVM process get 5GB heap space. 

4.2 “Single-record” / “Multi-Format” Table Message Processing 

Instead of testing all discussed patterns, we focus on the identified table-operations 
(e. g., selection / built-in, projection, join) and show the respective evalu¬ 
ation by example of a representative pattern (cf. Fig. [2]). The measurements for 
selection and projection use the TPC-H Order-based, approximately 4kB mes¬ 
sages (i. e., 1.5 million order messages). The union operation (e. g., aggregation 
strategy, content enricher) is not tested. 


Apache JMeter, visited 02/2015: http://jmeter.apache.org/ 
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We measured the selection / built-in operations in a content-based router 
scenario with a routing condition tipjrc (cf. List. |1.lj >, which routes the order 
message to its receiver based on conds for {string equality, integer less than} 
on fields {objecttype, ototalprice}. The bool rc function is implemented in Java 
to pass the expected value to the runtime system on system-level. The corre¬ 
sponding “hand-coded” content-level Camel-Java implementation uses JSON 
path statements for 0(1) element access and conducts the type-specific condition 
evaluation. The routing condition is defined to route 904,500 of the 1.5million 
messages to the first and the rest to the second receiver. Similarly, the projection 
operation is measured using a message translator. The translator projects the 
fields of the incoming order message to a target format (cf. List. 1.2) using a 
mttip implementation or a “hand-coded” projection on the Java Object represen¬ 
tation. Now, the “multi-format” customer messages (cf. 01 ) with nation records 
as processing context are used to measure a routing condition with selection / 
built-in and join operations (cf. List. 1.3 1 . The customer message is routed, if and 
only if, the customers balance (ACCTBAL) is bigger than 3, 000 and the customer 
is from the European region determined through join via nation key. 


Listing 1.1. Routing condition: tipjrc 
1 cbr—order (id , - ,OTOTALPRICE, -): 
2order(id , otype , — , 

3 OTOTALPRICE, —OPRIORITY, —), 

4=(OPRIORITY, ” 1—URGENT” ) 

5 > (OTOTALPRICE, 10 0 0 0 0.0 0). 


Listing 1.2. Message translation pro¬ 
gram: mtup 

1conv—order(id ,otype, 

20RDERKEY, CUSTKEY, SHIPPRIORITY): 
3 order (id , otype ,ORDERKEY, 

4CUSTKEY, - , SHIPPRIORITY, -). 


Listing 1.3. Routing condition with 
join over “multi-format” message 

1 cbr—cust (CUSTKEY,-):- 

2 customer ( cid , ctype ,CUSTKEY, 

3 CNATIONKEY, - ,ACCTBAL, -), 

4nation (nid , ntype ,NATIONKEY, 
5NREGIONKEY, -), 

6>(ACCTBAL,3000.0) , 

7=(CNATIONKEY, NATTONKEY) 

8 = (NREGIONKEY, 3). 


The throughput test streams all 1.5 million order / customer messages to 
the pipeline. The performance measurement results are depicted in Table [l] for a 
single thread execution. Measurements with multiple threads show a scaling up 
to factor 10 of the results, with a saturation around 36 threads (i. e., factor of 
number of cores; not shown). The stream conversion to JSON object aggregated 
for all messages is slightly faster than for ONC. However, in both order messages 
cases the TIP-based implementation reaches a slightly higher transaction per 
second rate (tps), which lets the processing end 7 s and 4 s earlier respectively, 
due to the natural processing of ONC iterators in the Datalog engine. Although 
the measured 99% confidence intervals do not overlap, the execution times are 
similar. The rather theoretical case of increasing the number of selection / built- 
in operations on the order messages (e.g., date before / after, string contains) 
showed a stronger impact for the Camel-Java case than the Camel-Datalog case 
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Table 1. Throughput measurements for format conversion, message routing and tran- 
formation patterns based on 4kB messages generated from 1.5 million standard TPC-H 
orders records. 


Format Content-based Routing Message Transformation 



Conv. 

(msec) 

Time 

(sec) 

Mean 

(tps) 

Time 

(sec) 

Mean 

(tps) 

(Join) 

Time 

(sec) 

Mean 

(tps) 

Camel- 

7,239.60 

108 

13,761.47 

126 

11,904.76 

103 

14,423.08 

Datalog 

+/-152.69 


+/-340.08 


+/-261.20 


T/-228.74 

Camel- 

6,648.50 

115 

12,931.03 

117 

12,633.26 

107 

13,888.89 

Java 

+/-143.55 


+/-304,90 


+/-176.89 


+/-247.40 

Datalog- 

7,239.60 

12 

122,467.58 

13 

116,780.00 

11 

133,053.20 

Bulk 

+/-152.69 


+/- 


+/- 


+/- 

(size=10) 



2,532.42 


1714,92 


1,645.39 


(not shown). In general, the Camel-Java implementation concludes with a routing 
decision as soon as a logical conjunction is found, while the conjunctive Datalog 
implementation currently evaluates all conditions before returning. In the context 
of integration operations this is not necessary, thus could be improved by adapting 
the Datalog evaluation for that, which we could experimentally show (not shown; 
out of scope for this paper). The measured throughput of the content-based 
router with join processing on “multi-format” the 1.5 million TPC-H customer 
/ nation messages again shows similar results. Only this time, the too simple 
NestedLoopJoin implementation in the used Datalog engine causes a loss of 9 
seconds compared to the “hand-coded” JSON join implementation. 

4.3 Outlook: “Multi-record” Table Message Processing 

The discussed measurements assume that a message has a “single-record” payload, 
which results in 1.5 million messages with one record / message identifier each. 
So far, the JSON to ONC conversion creates ONC collections with only one table 
iterator (to conform with EIP semantics). However, the nature of our approach 
allows us to send ONC collections with several entries (each representing a unique 
message payload with message identifier). Knowing that this would change the 
semantics of several patterns (e. g., the content-based router), we conducted the 
same test as before with “multi-record” table messages of bulk size 10, which 
reduces the required runtime to 12 s for the router and 11 s for the translator, 
which can still be used with its original definition (cf. Table [lj. Increasing the bulk 
size to 100 or even 1, 000 reduces the required time to 1 s, which means that all 
1.5 million messages can be processed with one step in one single thread. Hereby, 
increasing the bulk size means reducing the number of message collections, while 
increasing the rows in the single collection. The impressive numbers are due to the 
efficient table-centric Datalog evaluation on fewer, multi-row message collections. 
The higher throughput comes with the cost of a higher latency. The noticed join 
performance issue can be seen in the Datalog-bulk case as well, which required 
13 steps / seconds to process the 1.5 million customer / nation messages. 
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5 Related Work 

The application of table-centric operators to current integration systems has not 
been considered before, up to our knowledge, and was only recently introduced 
by our position paper m, which discusses the expressiveness of table-centric / 
logic programming for integration processing on the content level. 

The work on Java systems like Telegraph Dataflow he and Jaguar he can 
be considered related work in the area of programming languages on application 
systems for faster, data-aware processing. These approaches are mainly targeting 
to make Java more capable of data-processing, while mainly dealing with thread¬ 
ing, garbage collection and memory management. None of them considers the 
combination of the host language with table-centric processing. 

Declarative XML Processing Related work can be found in the area of declarative 
XML message processing (e.g., m- Using an XQuery data store for defining 
persistent message queues (i.e., conflicting with 03), the work targets a comple¬ 
mentary subset of our approach (i.e., persistent message queuing). 

Data Integration The data integration domain uses integration systems for 
querying remote data that is treated as local or “virtual” relations. Starting with 
SQL-based approaches (e.g., using Garlic [8j), the data integration research 
reached relational logic programming, summarized by [6]. In contrast to such 
remote queries, we define a table-centric, integration programming approach for 
application integration, while keeping the current semantics (for now). 

Data-Intensive and Scientific Workflow Management Based on the data patterns 
in workflow systems described by Russel et al. [14] , modeling and data access ap¬ 
proaches have been studied (e. g., by Reimann et al. m) in simulation workflows. 
The basic data management patterns in simulation workflows are ETL operations 
(e.g., format conversions, filters), a subset of the EIP and can be represented 
among others by our approach. The (map/reduce-style) data iteration pattern 
can be represented by combined EIPs like scatter/gather or splitter/gather. 

Similar to our approach, data and control flow have been considered in 
scientific workflow management systems he which run the integration system 
optimally synchronized with the database. However, the work exclusively focuses 
on the optimization of workflow execution, not integration systems, and does not 
consider the usage of table-centric programming on the application server level. 

6 Concluding Remarks 

This paper motivates a look into a growing “processing discrepancy” (e.g., 
message throughput) between current integration and complementary systems 
(e. g., MQ, RDBMS) based on known scenarios with new requirements and fast 
growing new domains (01-03). Towards a message throughput improvement, 
we extended the current integration processing on a content level by table-centric 


Towards More Data-Aware Application Integration (extended version) 


13 


integration processing (TIP). To remain compliant to the current EIP definitions 
the TIP-operators work on “single-record” messages, which lets us compare with 
current approaches using a brief experimental throughput evaluation. Although 
the results slightly improve the standard processing, not to mention the declarative 
vs. “hand-coded” definition of integration content, the actual potential of our 
approach lies in “multi-record” table message processing. However, that requires 
an adaption of some pattern EIP definitions, which is out of scope for this paper. 

Open Research Challenges For a more comprehensive, experimental evaluation, an 
EIP micro-benchmark will be developed on an extension of the TPC-H and TPC- 
C benchmarks. EIP definitions do not discuss streaming patterns / operators, 
which could be evaluated (complementarily) based on Datalog streaming theory 
(e. g ., USES]). Eventually, the existing EIP definitions have to be adapted to 
that and probably new patterns will be established. Notably, the used Datalog 
engine has to be improved (e. g., join evaluation) and enhanced for integration 
processing (e.g., for early-match / stop during routing). 
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A Supplementary Material 

In this section supplementary material is listed and briefly explained. The mate¬ 
rial provides further details on the described table-operator content / Datalog 
rules used for the experimental evaluation and can be seen as examples for the 
TIP definitions. In addition, the Apache Camel routes and the results of the 
experimental evaluation are shared for comparison and traceability. 

The supplementary material is provided inline for better readability and can 
be externalized to a separate document, if required. 


A.l Details on the Test Setup 

During the execution of our micro-benchmark in Java, we used the JMH code 
tool from OpenJDIi Q to overcome the “profile-guided optimization”. 

Data and Message Generation The message generation is a “two-step” approach 
consisting of a data generation and a message generation step. The data generation 
step uses the standard TPC-H generator on scale level one, from which we took 
the generated order records (i.e., CSV format), converted them to the JSON 
format, added additional message identifier and type fields, and stored them 
in one big JSON array. Each JSON object represents the payload of a single 
message (i. e., “single-record” message). During the test preparation, Camel 
message payloads are constructed as streams of these JSON objects and sent 
into the pipeline. In each individual test, the processors in the pipeline have to 
convert the JSON stream using the (Stax-like) Jackson Stream AP^] to either a 
Java or ONC represenation (cf. canonical data model; required an extension of 
Jackson for ONC), before applying the specfic pattern to test. That means, the 
conversions will be part of the measurements, however, extracted to a separate 
performance figure. We have decided to take the TPC-H data generator due to its 
applicability to our routing and transformation tests and since it is well-known 
in our domain. 


Message Routing and Transformation The selection / built-in operations are 
mainly used during message routing (e. g., content-based router, message filter). 
For the evaluation, we have defined a routing condition (cf. List. 1.11, which routes 
the message to its receiver, if and only if, two conditions are fulfilled for routing 
condition tipjrc: conds = {string equality, integer less than} on fields body.x = 
{objecttype, ototalprice}. The helpjrc function is implemented in Java to pass 
the expected value to the runtime system on system-level. The corresponding 
pure-Java, content-level implementation uses JSON path statements for 0(1) 
element access and conducts the type-specific condition evaluation “hand-coded”. 


1 OpenJDK JMH, visited 02/2015: http://openjdk.java.net/projects/ 

code-tools/jmh/ 

“Jackson Stream API, visited 02/2015: http://wiki.fasterxml.com/ 

JacksonStreamingApi 
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The routing condition is defined to route 904,567 of the 1.5 million messages. 
The projection operation is mostly used in message tranformation patterns (e. g., 
message translator, content modifier), from which we selected a message translator. 
The translator projects the fields of the incoming message to a target format (cf. 
List. |1 .21 using our mtup implementation or a “hand-coded” projection on the 


Java Object representation. One requirement from the mentioned scenarios (01) 
are multi-format, multi-record messages. This type of messages has one “primary” 
message body and (several) dependent sub-entries usually in a different format. 
The sub-entries act as “temporary” processing context (e. g., added by a content 
enricher), which can be pulled off, before forwarding the message to its receiver. 
For the evaluation, we use the join together with selection / built-in operations for 
a content-based router on a TPC-H customer message as .JSON object embedded 
in a JSON array with all entries of the nation table. The message is routed, if 
and only if, the customers actual balance (ACCTBAL) is bigger than 3,000 and 
the customer is from the European region determined through join via nation 
key (cf. Suppl. Material in Sect. A.2). 


A.2 Datalog Rules 

The Datalog rules, used for the table-centric integration processing on the TPC-H 
order, customer and nation messages during the experimental analysis are listed 
subsequently. 

Listing p~T| denotes the rule for the content-based routing pattern on a TPC-H 
order message, which is used to route orders to the first recipient, if and only 
if, the order is urgent (0PRI0RITY==T-URGENT’) and costly (0T0TALPRICE> 
100,000.00). In the message translation evaluation, order messages are translated 
to a target format with only the primary and foreign key fields and the shipment 
priority according to the rule in List. |1.2| The extended routing condition in 
List. |1.3| is used to route TPC-H customer record messages with the complete 
list of TPC-H nation records, if and only if, the actual balance is above average 
(ACCTBAL> 3,000) and the customer is from Europe region (NREGI0NKEY==3). 


A.3 Integration Scenarios: Apache Camel Routes 


In this section, the integration scenarios, used for the experimental analysis, are 
described in more detail. 


Listing 1.4 and List. 1.5 denote the Camel routes / message channels for the 
content-based routing cases. The direct-endpoint is used to receive a stream 
of messages in JSON format, which are then serialized into a Java Object or 
ONC-iterator form, the canonical data model (CDM). On the CDM subsequent 
operations can be defined and executed. The Camel choice defines a conditional 
routing based on a Camel expression evaluating the message’s body / data. The 
expression is one of the extension points in Camel to which our TIP operations 
can be applied (e. g., rcu p ). In the standard Camel-Java case a “hand-coded” 
Java expression, new CbrTpchExpO, is evaluated. If the expressions evaluate to 
false, the messages are processed in the otherwise channel. The same routes 
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are used for the normal and the multi-format router cases, only the expressions 
are exchanged. 

Another Camel interface that allows the definition of own content-level ex¬ 
tensions is the processor. To evaluate the message translation capabilities, we 
applied our TIP operators to the Camel processor to access and change the ONC 
messages. In the Camel-Java case a “hand-coded” processor is implemented as 
new MtTpchProc(). Both processors receive a message and translate its content, 
before handing it back to the system-level processing. 


Listing 1.4. Camel-Datalog route for 
routing according to rcu p 

1 from (” direct :datalog—cbr) 

2 . unmarshal () 

3 . c lr o i c e () . when () 

4 . expression ( bool_rc ( rc_tip ( 
5<conds >)))■ process (VOID) 

6 . otherwise () . process (VOID); 

Listing 1.6. Camel-Datalog message 
translation executing a mttip program 

1 from(” direct : datalog— mt” ) 

2 . unmarshal () 

3 . process (mt_tip(<program>)) 

4 . process (VOID); 


Listing 1.5. Camel-Java route for rout¬ 
ing 

1 from (’’direct : java— cbr ” ) 

2 . unmarshal () 

3 . choice (). when () 

4 . expression (new CbrTpchExp ()) 

5 . process (VOID) 

6 . otherwise (). process (VOID); 

Listing 1.7. Camel-Java message 
translation 

1 from (” direct :j ava—mt” ) 

2 . unmarshal () 

3 . process (new MtTpchProc ()) 

4 . process (VOID); 


A.4 Experimental Results 

In this section the plotted results of the experimental analysis are shown. Since all 
tests are based on 1.5 million TPC-H order or customer / nation (“multi-format”) 
messages, the required processing time is represented by the x-axis in steps of 
one second each. Hence, the earlier the samples stop, the faster the average 
throughput of the route, which is depicted as y-axis. All measurements were 
conducted with a single Camel route thread. The first three plots show the two 
content router and one message translator cases. The fourth one plots the results 
of the three cases with “multi-record” table message processing of bulk size 
of ten. Hence each the depicted discrete throughput measures actually do not 
represent messages per second, but collections of ten messages each per second 
(i. e., multiplied by factor xlO). 

The results of the first three experiments show that the TIP approach is 
competitive to “state-of-the-art” message processing in integration systems. The 
routing conditions with selection / built operations are better and the nested-loop 
join implementation is slightly worse than the “hand-coded” and optimized Java 
pendent. Hence, an enhanced join implementation will significantly improve the 
results of the Camel-Datalog evaluation in Fig. |3(b)| 
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Content-based Router (Selection / Built-in: TPC-H Order) 


Content-based Router (Selection / Built-in, Join: TPC-H 
Customer with Nation) 
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Fig. 3. Plotted results of the experimental analysis. 















































